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In this work we consider Runge-Kutta discontinuous Galerkin methods (RKDG) 
k£ for the solution of hyperbolic equations enabling high order discretization in space 

^5 and time. We aim at an efficient implementation of DG for Euler equations on 

GPUs. A mesh curvature approach is presented for the proper resolution of the 
domain boundary. This approach is based on the linear elasticity equations and en- 
ables a boundary approximation with arbitrary, high order. In order to demonstrate 
the performance of the boundary curvature a massively parallel solver on graphics 
<^ processors is implemented and utilized for the solution of the Euler equations of 

^ | gas-dynamics. 

1 Introduction 

a 

i— i The DG method is based on a discontinuous finite element spatial discretization originally 

introduced by Reed and Hill in the early 1970s for the neutron transport equation pp. Later, 
J> in a series of papers [2j [SJ 01 [51 [6] , Cockburn and Shu combined the discontinuous Galerkin spa- 

tial discretization with an explicit Runge-Kutta time stepping (RKDG method) and extended 
the method to systems of conservation laws. This created the opportunity for highly parallel 
implementations, since in the RKDG method, one grid cell only needs information from the 
immediate neighbouring cells to march in time. Based on this, Biswas et al. investigated the 
potential of RKDG methods for parallelization in [7j. 

Another outstanding feature of DG methods is that the degree of basis functions can be chosen 
arbitrarily, thus leading to high order discretization. However, attention has to be payed to 
the representation of curved boundaries. Bassi and Rebay applied this method to the two 
dimensional Euler equations and worked out the importance of the boundary approximation 
with respect to the solution quality [8j. They pointed out that the order of the method is 
limited by the order of the boundary representation. Thus, it is necessary to deal with curved 
element discretizations, in order to overcome this issue. While this seems not to be challenging 
for some test cases in two dimensions it is quite difficult to approximate complex geometries 
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arising, e.g. , from industrial applications. 

We present a mesh curvature procedure based on linear elasticity deformations which matches 
a desired boundary shape. Similar techniques are well known for the deformation of discretiza- 
tion meshes and avoid expensive remeshing [9]. In order to deal with discontinuities arising in 
the solution of hyperbolic equations, slope limiters where introduced to the RKDG method by 
Cockburn and Shu. Successful in the finite volume community, limiters are yet complicated for 
higher order methods. This gives rise to the idea of adding artificial viscosity to the equations 
in order to smear out discontinuities and control the width of shocks. Persson and Peraire pro- 
posed a detector to apply artificial viscosity only in the vicinity of shocks [TO]. This approach 
seems very promising because of its flexibility, since there is no dependency on the order of the 
scheme or the geometry of the discretization elements like for slope limiters. 
Recently, there has been payed a lot of attention to discontinuous Galerkin methods because of 
their potential in terms of parallelization and high performance computing (HPC). Especially 
the nodal DG method proposed by Hesthaven and Warburton [11] was shown by Klockner, 
Warburton, Bridge and Hesthaven [12] to perform very efficiently on modern graphics proces- 
sors (GPUs) gaining high speedups compared to conventional codes. 

In this work, we present a novel approach combining the following aspects. We implement 
a high order Runge-Kutta discontinuous Galerkin method for the Euler equations of gas- 
dynamics on GPUs. For that, we propose an approach introducing unstructured, curved- 
element DG discretizations into a massively parallel GPU algorithm. Furthermore, we present 
a mesh curvature approach, which enables high order accurate boundary representations and 
seamlessly fits into the DG framework. In particular, we cover the whole simulation chain from 
the generation of curved, body-fitted meshes up to the parallel HPC system solution. Finally, 
we demonstrate the performance of this algorithm on some challenging transsonic test cases, 
where discontinuities arise in the solutions. 

This paper has the following structure. In section [2] the discontinuous Galerkin method is 
shortly introduced and it is shown how the discrete operators are composed. It is also covered 
how discontinuities in the solution are handled. In section [3j we describe an approach for deal- 
ing with curvature in the DG discretization. Section [4] shows the implementation on GPUs. 
Finally, in section [5] and [6] numerical results are presented and discussed. 



2 The discontinuous Galerkin method 

In this paper, we study a discontinuous Galerkin method for systems of hyperbolic conservation 
laws of the form 

^ + <^) + ... + a^) =0 

at oxi oxd 

U(0,x) = U (x) 

where P:RxK d 4l", U(t,x) = (Ui(t, x), . . . , U n (t, x)) T is the vector of conserved 
quantities at a point x in (i-dimensional space and at time t. Here O C H rf is the domain of 
interest and [0, T] a time interval. The vector fields Fi : W 1 — > W 1 in this system are usually 
referred to as flux vectors. 

For the purpose of this work we will maintain a more compact and widely used notation. 
Introducing the tensor F : 1" ->• R nxd , F(U) = {F^U) . . . F d (U)) in terms of the fluxes 
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Figure 1: Mapping from physical to reference element 



we then obtain 

d ^+V-F(U) = 0. (1) 

We follow the approaches in |13| and mostly use the notation therein. For the derivation of 
the discontinuous Galerkin method, we assume that the domain of interest Q is subdivided 
into a finite set of K disjoint, conforming elements = Ufc=i ^k- I n our approach this is a 
tetrahedral mesh with curved elements. The solution is then approximated using a space Vh 
of element-wise defined polynomials ipj up to degree p 

K 

V h = Vl V k h = span{^ (J2 fc ) ,j = l,...,N p }. 

k=l 

Here, N p = ^jj^p is the number of basis functions depending on the desired degree p and the 
spatial dimension d. Multiplying with a test function <3? from the same space and integration 
over the element fifc leads to 



$dn = o. 



Integration by parts then yields the weak discontinuous Galerkin formulation 

J dQ = - J (F{U)* -7t)<S>dS + J F(U) • VM1 (2) 

Here, ~n denotes the outward pointing normal vector and F* an approximate Riemann solver, 
which deals with the double-valued state at the element interfaces, e.g. upwinding. These 
approximate Riemann solvers are well known from finite volume context and a survey can be 
found in [14J. For the purpose of this work, we tested both a local Lax-Friedrichs and a HLLC 
Riemann solver but without noticing major differences. 

In the following, a discrete version of equation ^ is derived, which can be implemented on 
a computer. In order to calculate the integrals occurring in ^ we have to establish a link 
between an arbitrary curved element $7^ in the mesh and the reference tetrahedron T = {—1 < 
r,s,t < 1, r + s + t < —1} on which cubature/quadrature rules are known. This is done by 
mapping functions ^ : — > T for each element as illustrated in figure [T] connecting the 
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physical coordinates x = (x, y, z) to the computational ones r = (r, s, t). Moreover, the partial 
derivatives of ^ k have to be known since integration for any functions f,g : £2^ — > W 1 in the 
physical space is evaluated as 



/ 



n k T=<n k (n k ) 

with the convention 



fgdn= J f(^ k )g^ k )\det(D^ k )\dT 



D^> = [ s x s y s z and J = |det (DV)\ . (3) 

\t X ty t z J 

Hence, it is sufficient to derive the local operators for the reference element and transform 
them into the physical space. Then, we introduce a orthonormal, hierarchical set of modal 
basis functions = 1, . . . , N p } on the reference element T ■ An arbitrary function / on T 

is then interpolated on a given set of collocation points {r«, i = 1, . . . , N p } (figure [2^i) as 

N p 

f(r i ) = J^f j ^ j (r i ), Vi = l,...,N p , 

j=0 

with modal expansion coefficients / = (/i, . . . , fN p ) T - The interpolation can be reformulated 
in terms of a multivariate Lagrange polynomial basis 

N p 

f(r i ) = Y / f(r i )l j (r i ), Vt = l,...,JVp 

j=0 

with nodal values / = (/(^i), • ■ ■ , f(fN p )) T ■ These grid points are chosen according to [15] to 
ensure good interpolation properties. Introducing the Vandermonde matrix Vij = ipj{ri) we 
obtain that these two formulations are linked as 

/ = Vf. 

Since there are as many basis polynomials as collocation points and depending on the inter- 
polation quality V is nonsingular. Now, we are prepared to define the local operators for the 
integration in For that purpose, we distinguish three sets of nodes: the collocation points 
ri as introduced above and two sets of nodes used for integration. These are on the one hand 
cubature points {r? ub ,i = 1, . . . , iV cu b} (figure^) for volume integrals [16] and on the other 
Gauss quadrature points {rf , i = 1, . . . , N g } (figure^) for surface integrals [T7j . 
For the derivation of the discrete operators, we closely follow |13j . First, in order to interpolate 
the solution given at the nodal points to quadrature points r™ b and rf we need the following 
interpolation matrices 

Icub = VcubV- 1 € R^cubxtf, ( Zg = VgV 1 G R N * xN r 

where (V cu b)ij = 4>j ( r f ub ) and (^c?)^- = V ; j( r f)- Multiplying the vector of nodal unknown 
values with these matrices first transforms them to modal expansion coefficients and then 
evaluates the modal basis functions at the cubature points. 
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(a) Collocation points n (b) Volume cubature points r\ nh (c) Surface quadrature points rf 

Figure 2: Collocation and quadrature points 

For the construction of the stiffness matrices we need to evaluate derivatives like ^ at the 
cubature points. Here we introduce the operator 

D r = VrV' 1 e R n ^ xN p 



where (V r 



r JV ~ dr 



. The s and t derivatives are analogous. 



„cub 

i 

With this we are now prepared to assemble the local stiffness matrix for one specific element 
leaving out the element number k for clarity 

S x = {pj • diag(r x , i ) + D T S • diag(^) + Dj • diag^)) • diag( J^ cub ) G R N ^ N ^ (4) 

where W^™ 13 are cubature weights. S y and S z are obtained in the same way. The local mass 
matrices are then given by 

M = lj nh ■ diag( JiWr h ) ■ X cub E M***"* (5) 

Finally, we obtain the local face mass matrices 

M m = Tj • diag^Wf ) G R N " xN v (6) 

where Wf are the weights of a 2d Gauss integration rule for triangles. Plugging these operators 
into equation ^ yields the semi-discrete system 

^ = £ S KXm F m (uf°) ~ M^M dnk (F (I!*)* ■ It) . (7) 

m=l 

In the equation above, U^ uh = I C ubUh denote the degrees of freedom associated with the 
cubature points used for the volume integration (figure [2jb) and t/| = T g Uh the ones for 
surface integration (figure [2b) respectively. Since the operators are only local and therefore 
represented by small matrices, a LU or Cholesky decomposition of can be calculated in an 
initial step of the solver. 

Finally, the system is discretized in time using a fourth-order accurate Runge-Kutta pseudo 
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time stepping consisting of five stages. We use a low storage version of this scheme as described 
in |18| which does not require to keep the intermediate stages in memory. Furthermore, the 
application of an explicit time integration like this RK method does not require to set up a 
global matrix system. Thus, the method above is a so called matrix-free approach which is 
especially in three dimensions very attractive with respect to memory requirements. 
When dealing with fluid dynamics (see section [5]), the method described above works reliably 
for low Mach numbers. Yet, for higher Mach numbers, shocks might appear in the solution 
leading to strong non-physical oscillations. This phenomenon is well understood and can be 
overcome by adding a small amount of artificial viscosity to the equations as proposed by 
Persson and Peraire in |10j . 

However, in regions where the solution is smooth there is no need for stabilization. Hence, 
these authors compare the approximated solution of one component to the solution 
where they drop the modes with the highest frequencies 

N p N p -i 

J=l 3=1 
with the following smoothness indicator 

fn k (uh - Uh) • (uh - Uh) dtt 
f Uk Uh ■ u h dtt 

In smooth regions of the solution, this indicator will be close to zero whereas it increases in 
regions with high frequencies. The amount of artificial viscosity in element k is then determined 
as 



if Sk < so — k 

f (l + sm ^X^ ) if S -K<S k <S + K 



e if s fc > s + k 



with Sk = log 10 (Sk), so ~ logio (A ) an< ^ empirically chosen parameters k and eo- Thus, by 
applying a shock detector it can be decided where shocks arise and where to add viscosity. 
The modified system of equation is then given by 

^ + V-F(U) = V-(eVU). 

This can be rewritten as a system of first order equations and discretized using the same DG 
approach 

dU 

— + V • F(U) -V-^eq = 
q - VeW = 0. 

Through this procedure the semi-discrete system ([7]) is expanded by an additional equation 
and the viscous fluxes yielding (c.f. |19j ) 

Mk 1 E Sk, Xm [F m (U^ h ) - v^C b " M^M g [(F [UlY - (v^f)*) ■ r*] 



^ = Mr 1 V _ \jr- ( TT?»A _ . /7^ub 
dt 

3 



m=l 
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Figure 3: Deformation to a sphere 

3 Mesh generation for DG 

As pointed out in the introduction, a remarkable feature of the DG method is the freedom 
to choose the degree of basis functions. However, because of that the mesh has to be dealt 
with caution. It is tempting to reuse the meshes from the finite volume community. Yet, this 
leads to major problems, since the discretization with straight-sided elements leads to kinks 
at the boundary walls, which is problematic whenever the fluid is interacting with curved 
geometries. In this situation the numerical solution might contain small non-physical shocks 
in each element at the boundary. In principle, this could be overcome by using a very fine 
discretization. However, this is conceptually in contrast to higher order discretization methods. 
Thus, it is necessary to introduce curved elements into the discontinuous Galerkin discretization 
in order to enable a higher order boundary representation. 

An isoparametric mapping is applied for the realization of the element curvature. Assuming 
that the deformation of the elements can be represented in the same Lagrange polynomial 
basis as used for the DG scheme, we only need to know the displacement of the collocation 
points with respect to the ones in the straight sided element. Thus, it is not enough to have 
a functional description of the curved boundary. Additionally, the locations of the collocation 
points have to be corrected such that they do not fall out of the element. Moreover, this 
displacement procedure should work smoothly, in order not to perturb the integration. 
In two dimensions this process is simple. Here, only the elements sharing a face with the 
boundary wall are affected. By ensuring that the collocation points at the two other faces 
remain on their initial position, the neighboring elements remain untouched, even if they share 
one vertex with the boundary. Obviously, this does not hold for the three dimensional case. 
By deforming a face on the boundary the other three faces in general have to be curved as 
well. This forces a whole layer of cells around the wall to be curved. 

In the following, we focus on how to transport the curvature information of the boundary into 
the mesh. As mentioned before, in three dimensions a lot of cells are affected. Hence, we 
suggest instead of tracking these cells to put a bounding box around the geometry marking an 
area in which cells get curved. We follow the ideas in [9] in order to model the discretization 
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Figure 4: Leading edge of ONERA M6 airfoil surface - initial and curved geometry 



mesh as embedded in some flexible material and solve the linear elasticity equations with a 
finite element method. By this, the curvature information of the boundary is transported into 
the mesh and can be retrieved at arbitrary points. 

In order to keep this process as cheap as possible, we extract the cells which should become 
curved from the CFD-mesh and solve the linear elasticity equation on the resulting sub-mesh. 
The governing equations of the linear elasticity are given by 

V • a = / onfi c . 

Here / denotes body forces acting on the solid, which will not be present in our case. f2 c C $1 
is the region of the DG mesh containing the cells to be curved. Finally, a denotes the stress 
tensor and is defined by the strain tensor e as 

a = ATr(e)J + 2/xe , e = - (Vit + Vu T ) . 

In the equations above, A and [i are the Lame parameters, which can be expressed in terms of 
the Young's modulus E and Poisson's ratio v 

vE E 
(l + v)(l-2v) ' ^~ 2(1 + 1/)' 

The vector-valued function u : Q c — > M 3 describes the displacement which the material would 
perform under the prescribed forces. In our case, it is not important whether the chosen 
coefficients represent a physical material. Nevertheless, it is rather crucial to know how they 
act on the solution. Young's modulus can be seen as a measure for the stiffness and Poisson's 
ratio describes how much a material expands in two coordinate directions when compressed in 
the third. Since / = in our case the solution does not depend on the stiffness parameter E. 
Thus, the mesh deformation is controlled by v only. This parameter is then chosen depending 
on the situation. If we want to compute the flow past a sphere, like illustrated in figure [3j 
we apply a rubber-like material (y ~ 0.5) in order to obtain a smooth deformation, which 
propagates in each coordinate direction. In the NACA0012 case, which is a 3d staggered 
version of the 2d test case, we choose v close to in order to avoid deformation into the third 
dimension. 

This system is then completed by the boundary conditions 

onToi, (8) 
onr D2 , (9) 

onTiv. (10) 



u = g 
u = 
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Figure 5: Tip of NACA0012 profile - initial and curved grid 



We choose the boundaries such that T^i is the objects surface. Td2 is the bounding box defin- 
ing the sub- mesh for the elasticity equation. Finally, Ttv is used to model symmetry walls, 
where collocation points are allowed to slide but may not leave the plane. 
One great advantage of this approach is the variable order of boundary representation. Since, 
we are free to choose the degree of the finite element basis functions, we can fit the boundary 
representation to the order of the DG scheme. 

Moreover, in most situations this approach avoids singularities due to the deformation. This 
means, if the gap between the straight-sided mesh and the curved boundary is not to large, we 
can expect a feasible deformation of the mesh without overlappings or zero- volume cells. 
However, this approach requires a parameterized representation of the surface, since the dis- 
placements at the boundary have to be evaluated at arbitrary coordinates. While this seems 
not to be a problem for many geometries existing as CAD models, we are also dealing with 
geometries only available as a set of vertices. Yet, there is a lot of literature dealing with the 
problem how to find a NURB surface representation to a given set of points. 
For a parameterized surface it must be decided how the straight-sided mesh must be mapped 
to fit the curved geometry. For this purpose a closest point problem has to be solved 



where S(-,-) denotes a NURB surface and x is a point on the straight-sided boundary. We 
solve this problem by the Gauss- Newton algorithm. Let (a*, (3*) be the optimal solution, then 
g(x) = S(a*,{3*) — x is plugged into the boundary condition on r^i. 

With these boundary conditions we are now able to apply a high order finite element solver for 
the linear elasticity equation. For this purpose we use the GETFEM finite element toolbox. 
Then, we store the solution obtained together with the CFD-mesh and use it later in the 
discontinuous Galerkin solver in order to retrieve mesh displacement information at arbitrary 
locations. 





s.t. 0<a,/3<l 
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4 Application on GPUs 



Reviewing the DG method described above, it turns out that there is a lot of parallelism in 
the executions. On the coarsest level one can proceed as in other discretization techniques like 
finite volume methods and partition the mesh. Due to the nature of DG this does not lead to 
much data transfer between the compute nodes since only values at the element faces have to 
be communicated. 

The same holds on a second level of parallelism. As seen in the derivation of the DG method in 
section [2j there is only a loose coupling between elements. Most operations can be performed 
independently. Just the evaluation of the surface integral requires the nodal values on the face 
of the neighbouring element to be known. 

Furthermore, we can distinguish a third level of parallelism, the most interesting one. Both 
on the volume and on the surface quadrature nodes, the nonlinear flux function F has to 
be computed, which is a very expensive operation. Not only the high order basis functions 
but also the rational Jacobian of the non-affine mapping force us to apply a very high order 
quadrature rule leading to a huge number of quadrature points. However, these operations are 
also independent per node. 

These features suggest the application of graphics processing units, i.e. GPUs, to the DG 
method, since the design considerations above perfectly fit into the CUDA hardware model. 
Originally designed to render many geometrical primitives or pixels in parallel the Nvidia Fermi 
architecture nowadays is based on a set of 16 so called streaming multiprocessors (SM). Each 
of them features 32 cores which leads to a total number of 512 CUDA cores enabling hundreds 
of floating point operations at a time. 

From the programming point of view, CUDA organizes threads in a hierarchical structure. 
On the lowest level there are the CUDA threads each having its own registers for temporary 
variables. These threads are then gathered in thread blocks and identified through a three 
dimensional index. Each of these blocks comes with 64 kb shared memory in which data can 
be collected and distributed between the threads. One major advantage of this concept is that 
shared memory is on chip. Thus, data which is used multiple times or has to be shared between 
threads has to be requested from the global RAM of the GPU only once. 
Finally, on the top level blocks are organized in a grid structure and again identified through 
three dimensional indices. 

For further details we refer to the CUDA documentation (20]. Yet, one crucial aspect should be 
addressed. In most cases the performance gain of a massively parallel GPU algorithm strongly 
depends on memory fetching strategies. Generally, whenever the threads of a block {to, . . . , t n } 
fetch data from an array A = {ao, • ■ • , cln} out of the global RAM of the GPU, these accesses 
should be organized in a blockwise fashion, where the blocksize is a multiple of 16. The threads 
to,...,t n should access successive data elements - each thread one element - where the first 
index is a multiple of 16. Thus, memory accesses into the ith block should be organized as 

~* a (i6i+j) > J G {0, . . . , n} 

where r is a permutation on {0, . . . , n}. In this case, multiple memory accesses can be per- 
formed at once leading to so called coalesced accesses. Whereas, when a thread block requests 
data which is scattered throughout the memory this ends up in serialization of the accesses. 
This issue is crucial particularly for older GPU hardware where finite volume methods (FVM) 
have been implemented for fluid dynamics. Since the FVM can be seen as a special case of 
the DG method for p = 0, there is only one unknown per cell. All neighbouring variables are 
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involved in the update of the value of this unknown, which leads to a high ratio of scattered 
data accesses in case of unstructured meshes. While GPU acceleration only seemed to work on 
structured meshes in FVM, the discontinuous Galerkin method overcomes this problem. Due 
to the higher order of the method, there are plenty of values per cell which can be organized 
successively in memory allowing high speedups even on unstructured meshes. 
This hardware and programming layout can now be identified with the basic steps of the 
DG method. As previously mentioned, operations on one element are mostly independent of 
the executions in other elements. Thus, it is obvious to assign one CUDA block to each DG 
element. Then, the arithmetically expensive nonlinear flux evaluations can be performed in 
parallel. Moreover, the subsequent matrix vector products can also be handled very efficiently 
within each block since we are dealing with small, dense matrices. Finally, the CUDA exe- 
cution handler decides how many thread blocks can be executed in parallel depending on the 
order of the DG method. 

We have to reorganize the unknown values in order to meet the memory pattern of the GPU. 
The length of each field should be a multiple of 16 which is achieved by so-called zero padding 
as demonstrated in (12). Here, f7* denotes the unknown values inside element k. As illustrated 



in chapter [2] we are dealing with a vector of n unknowns interpolated at N p collocation points. 
Thus, the padding length is determined such that the block length is enlarged from N p to 
bp = \ jVp 1 ^ 15 ] and we ensure that the start address of each field is a multiple of 16 



.., 0, 
.., 0, 



jjn Tjn 



TJ 2 

N v —li 



, 



(12) 



padding 



The same works for the unknowns at the quadrature points. In contrast, the values at the 
cubature points are only used once and do not need to get shared with other elements. Thus, 
we do not need to store them in the global RAM. 

In the following, we will consider the element specific volume integral in equation ^ as an 
example. For the integrations inside each element, we also need to fetch the element specific 
local operators from the global GPU RAM. Thus, they are stored in a similar way like the 
vector of unknowns. For example the S x operator in equation Q is stored as 



'(0,0)' 

-.x 

'(0,1)' 



'(1,0)' 



'(JVp-1,0)' 



S (l,l)' •••' S (Af p -l,l)' 



0, 0, 
0, 0, 



'(0,iV cu b-l)> S (l,AT cub -l)> ■■•' S (AT p -l,7V cub -l)> U ' ••■' U 



(13) 



padding 
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Figure 6: Algorithm execution on one compute node 



In order to store the interpolation matrix I cu b the block length has to be as 6 cu b = [ jVcub+15 
leading to 



16 



-Z^cub 



''(O.O); t (l,0)> 
i (0,l)' t (b 1 )' 







(14) 



padding 



Note that these operators are stored in column- major form, which is important for the matrix- 
vector product. 

With this data structures in global GPU RAM we implement the CUDA grid to have K 
thread blocks, which will be associated with the DG elements. Each of these blocks should 
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contain -/V cu b threads {to, . . . , t7v cub -i} and a matrix U[n] [iV cu b] in shared memory to cache the 
unknown values. At this point, we can assume that the matrix U is large enough to store both 
Uh and C/cub since iV cu b > N p . This holds, because for a proper integration of the nonlinear 
flux function there have to be more cubature points than collocation points. The initial step 
is to cache the nodal values Uh into the shared matrix. For that purpose the first N p threads 
load and store the n fields of Uh into U. Then the product I cu bU^ is evaluated column-wise. 
The threads to, . . . ,tN cub -i load one column of Z cu b, each thread one value. Since Z cu b is stored 
in column-major form, these loads are coalescent. Then the n values in U corresponding to 
this column are broadcasted to all threads and the multiplication is performed. The broadcast 
operation is very efficient, since the matrix U was fetched to the shared memory which is on 
chip in contrast to the global GPU RAM. This is done successively for all columns and each 
thread is tracking the sum over all columns in its registers. Finally, by this procedure, we 
obtain the nodal values interpolated to the cubature points U^ nh , which are again stored in U. 
Through this procedure the dot products between the iV cu b rows of X cu b and the fields of U 
are performed in parallel each by one CUDA thread. 

In order to calculate the volume integral in ^ which is approximated in the discrete system 
|7| by Ylm=l $k,x m F (f^ ub ) the threads to, ■ ■ ■ , tjv cub _i evaluate the flux function F on U in 
parallel. Finally, the multiplication with the three stiffness matrices S Xl , S X2 , S Xa is treated 
as described above. However, this matrix vector product can not be handled as one thread 
per output value, since S Xi has more columns than rows in general. Thus, the threads are 
distributed over several matrix columns, which are then handled in parallel to maintain the 
occupancy of the streaming processors high. 

For this implementation, we were inspired by the MIDG code |21| . MIDG is a very lightweight 
discontinuous Galerkin solver for Maxwell's equations designed to run on multiple GPUs. A 
mesh partitioning with ParMetis is involved which minimizes the cuts between processes and 
thus reduces communication. The time discretization is realized by a low storage Runge-Kutta 
scheme as described in section [2] For the spatial discretization, the nodal DG scheme as 
described in [13] is applied. We mostly adopted the parallel framework of this code including 
the mesh partitioning and the MPI communications. However, the quadrature free approach 
therein does not support curved elements and leads to errors when dealing with nonlinear 
PDEs like the Euler equations (cf. section [5]). We thus had to apply a computationally more 
expensive, quadrature based scheme as introduced in section [2} 



p 


CPU 


GPU 


GPU (float) 


speedup 


speedup (float) 


4 


278.92 s 


54.16 s 


15.27 s 


5.15 


18.27 


3 


92.88 s 


21.34 s 


7.96 s 


4.35 


11.67 


2 


34.01 s 


8.40 s 


3.52 s 


4.05 


9.66 



Table 1: Run time of the time stepping loop, 200 iterations on 25956 cells 

An overview of our algorithm is given in figure [6} We work on a compute cluster - shared or 
distributed memory - on which each node is equipped with one GPU. Here, the executions on 
compute node j are described and it can be seen that after the interpolation to the quadrature 
nodes on the element faces the communication with other processes takes place. For that 
purpose, the values on processor cuts are downloaded from the GPU and distributed to other 
compute nodes. In order to save time, this is done asynchronously, while the volume integration 
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is executed. 

In our test setting, we have a shared memory system equipped with 8 Intel Xeon E5620 CPU 
cores and 8 Nvidia Tesla M2050 GPUs. For measurements of the GPU accelerated code against 
a conventional CPU code, we run a test on one GPU and compare it to a CPU code where the 
matrix vector multiplication is implemented using BLAS routines. Since the underlying CPU 
has four cores, we use an OPENMP parallelization to achieve full utilization of its capacity. 
For that purpose we hint the compiler to apply a parallelization of the k-loop over all elements 
(cf. equation Q). In the test setting of table [T] we achieved a run time on one CPU core of 
1016.36 s for p = 4 elements. Compared to 278.92 s for the parallel version this reflects the 
theoretical speedup on this hardware architecture. Furthermore, table [T] shows the run times 
for 200 Runge-Kutta iterations on a mesh with 25956 cells for the CPU and GPU code. Here, 
the most interesting polynomial degrees with respect to CFD are inspected. For the GPU run 
times we distinguish a single precision (float) and a double precision implementation. The 
last two columns of table [T] show the gained speedup compared to the CPU code. We observe 
that the speedup increases with the polynomial order. This stems from the higher number of 
quadrature points in one element leading to a better utilization of the available CUDA threads. 
Overall, on the most interesting level for this paper of p = 4 elements, speedup factors of 5 up 
to 18 for single and double precision are achieved. 



5 Numerical Results 



As a special case of equation ([I]) we consider the Euler equations of gas-dynamics in three 
spatial dimensions. These equations describe the motion of an inviscid fluid without heat 
conduction 
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Here, the unknown function reads as 
[/:Rxl 3 ^l 5 , U(t,x) = 



(p(t,x), pu(t,x), pv(t,x), pw(t,x), pE(t,x)) J 



where p denotes the density, u, v, w the fluid velocities in the three space directions and E the 
total energy. Finally, the system is completed by the perfect gas law for the pressure 



p = (7 - 1) [pE 



u 2 + v 2 + w 2 
where 7 is the adiabatic index of the fluid [22]. 

Our first test case is a subsonic flow past a sphere at =0.38 which was examined together 
with the discontinuous Galerkin method in j8]. This test case is very attractive, since the 
surface curvature is straightforward. Nevertheless, we apply our curvature approach to obtain 
a smooth volume mesh. Let r denote the radius and xo the centroid of the sphere. Then the 
boundary conditions in equation (TsT) for the mesh deformation are given by 



g(x) = x + 



;X 
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(a) Disturbed Solution on straight sided elements (b) Isoparametric curved elements 



Figure 7: Density distribution of a subsonic flow past a sphere with p = 4 basis functions 



As shown in figure [3j we start with a straight sided grid and solve the linear elasticity equation 
on the surrounding volume grid with p = 4 basis functions. Thus, we obtain a smoothly curved 
volume discretization, which we then use as input for the DG Euler solver also based on p = 4 
basis functions. We note that best results are obtained when the order of the linear elasticity 
solution matches the order of the DG solver. In this case the deformation lies within the range 
of the DG basis functions and can be resolved properly. 

Figure [7] shows the effect of curved boundaries on the solution quality. On the left hand side the 
solution is disturbed due to straight sided elements in the discretization mesh. This leads to 
small yet problematic kinks between the elements on the surface and results in a non-physical 
solution. In contrast, the computations on the right hand side were performed on a curved 
grid, which is in addition coarser. In this case the linear elasticity deformation was calculated 
using polynomials of order four. 

The next test case is the NACA0012 symmetric airfoil. Although the geometry is only two 
dimensional, we stretch the profile into the third coordinate direction in order to apply the 
same solver as for the other test cases. Here, the effect of the boundary deformation on the 
volume mesh can be visualized as figure [5] shows. 

In the NACA0012 situation we consider two flow conditions. First, a pure subsonic flow with 
no angle of attack and Mach number M m = 0.4 (figure^). Figure [8)3 shows a = 0.8 
transonic flow with angle of attack a = 1.25°. In both pictures, the density distribution is 
plotted. Since we are dealing with discontinuities in this situation, we have to add artificial 
viscosity to the system of equations in order to ensure stability of the method. For that goal, 
we proceed as described in section [2] and apply a shock detector to select the troubled cells. 
In the following we consider the Onera M6 wing. This test case is well known and examined 
with a variety of numerical methods. The original experiment was defined in [23] at Mach 
number Moo = 0.8395 and angle of attack a = 3.06°. Again, we start with a straight-sided, 
tetrahedral mesh and a NURB representation of the ONERA M6 geometry and use the defor- 
mation approach to obtain the curved mesh (figure [4]) . Figure [9] shows the density distribution 
on the upper surface of the airfoil. Also in this case, artificial viscosity is applied to the system 
to deal with the shocks on the upper surface of the airfoil. In both the NACA0012 and ONERA 
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(a) Subsonic flow, ¥„ = 0.4, a = 0° (b) Transsonic flow, M x = 0.8, a = 1.25° 

Figure 8: Flow past NACA0012 airfoil with p = 4 elements 

M6 situation we set the parameter for the viscosity k = 4 and eo = 0.3. 

For this computations we utilize a so-called p-refinement as a acceleration technique. This 
means that we start the time stepping with low order polynomials of degree p\, e.g. pi = 1, 
iterate until the Runge-Kutta update is below a specified tolerance and then switch to a richer 
polynomial space of degree p2 = Pi + 1. This process is repeated until the desired order is 
reached. The transfer operation between the coarser and the finer polynomial space is straight- 
forward, since we are dealing with a hierarchical set of basis functions as introduced in section 
[2] Thus, the solution on level pi can be directly embedded into the pi+\ space, where the 
higher order polynomials are initially weighted with zeros. 

This procedure offers two advantages in terms of acceleration. Firstly, computations are much 
cheaper on a coarser level of p, since the number of unknown values is smaller. Furthermore, 
the cubature and quadrature formulas do not have to be as precise as for higher order poly- 
nomials, which leads to smaller local operators. Secondly, the timestep for the Runge-Kutta 
time integration scales with 0(p~ 2 ), c.f. [13]. This enables a much larger timestep for smaller 
values of p, which reduces the number of iterations needed to reach the steady state solution. 
For the computations shown in this work, we started with p = 2 basis functions and subse- 
quently refined to p = 3 and p = 4. We found out that this is a convenient choice, since we 
are dealing with relatively coarse grids, where computations with p = 1 or even p = do not 
lead to suitable results. 
Table [2] shows two different algorithm executions for the sphere test case. The underlying 
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timestep 
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timestep 


time consumed 
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8000 


6e-5 


105.8 s 
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10000 


2e-5 
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5e-6 


650.0 s 
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61000 


5e-6 


1982 s 
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61000 




1982 s 



Table 2: Algorithm execution for the sphere - with and without p-refinement 
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rho 




0,65 



Figure 9: Flow over Onera M6 airfoil with p = 4 elements 

discretization mesh consists of 25956 tetrahedra and 4457 of those are curved elements. More- 
over, on the finest level, each element contains 35 collocation nodes, 70 cubature nodes and 
4 • 16 surface quadrature nodes. On the left hand side, the simulation was started with p = 2 
basis functions and then subsequently refined to p = 3 and p = 4 functions. Here, the number 
of iterations on level 2 and 3 was chosen empirically. In contrast, on the right hand side, 
the simulation was run using p = 4 basis functions only. Both executions terminated when 
the infinity norm of the residual for p = 4 was below a tolerance of le-9. For this particular 
case, the execution time is more then halved by the p-refinement procedure. The number of 
iterations on each level may look surprising, since it is multiple of 1000. This stems from the 
fact that the calculation of a norm is an expensive operation in parallel. In contrast, a few 
extra Runge-Kutta evaluations are relatively cheap compared to the parallel reduction for the 
norm. Thus, we evaluate the norm of the residual only after multiples of 1000 iterations. 

6 Conclusion 

In this work, we have presented a curved mesh generation approach. This approach is based 
on the linear elasticity deformation of an initial grid with straight sided elements until it meets 
the desired curved boundary. We demonstrated that the boundary approximation can be of 
arbitrary high order reducing the error induced into the discontinuous Galerkin discretization 
due to under-resolved geometries. The solution of the linear elasticity equations with a high 
order finite element method seems computationally very expensive at a first glance. However, 
this has to be done only once. We precompute the curvature and the DG solver only needs to 
evaluate the FEM-solution at the desired nodes. 

In a second step, we showed how these curved meshes are embedded into a GPU based parallel 
DG solution of the Euler equations of gas-dynamics. Furthermore, the performance of this 
massively parallel GPU code was tested where we gained a speedup of up to 18 compared to 
the serial version of this code. We have chosen some challenging test cases including transonic 
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flows leading to discontinuities which were treated with an artificial viscosity approach. 
Although the combination of an explicit Runge-Kutta time integration with diffusion terms 
leads to very small time steps we have seen that the outstanding parallel performance of the 
GPU overcomes this issue. And we believe that this fact justifies the additional implementation 
effort for the GPU code. 
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